Intro

Background

Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.

Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.

Problem

Scotty turns out being a very popular service in Turkey! The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.

Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Saturday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.

Our target is to get information which hour and area that have “insufficient” label, it mean we want to know which hour and area that really need more driver.

Import Library

To solve this case we need import some package to help process and predict:

library(GGally)
library(tidyverse)
library(lubridate)
library(scales)
library(rsample)
library(caret)
library(partykit)
# Smote for unbalanced data
library(DMwR)
# KNN modelling
library(class)
#ROCR
library(ROCR)
# Naive Bayes 
library(e1071)
# Visualize
library(plotly)
source("script/matrix_result.R")
source("script/metrics.R")

confussionMatrixPlot <- function(table) {
  table %>% ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), fontface = "bold", color = "white") +
  theme_minimal() +
  theme(legend.position = "none")
}

Read Data

Scotty provide transaction detail transaction from October 1st, 2017 to December 2nd, 2017 as train dataset. Based on this train data we can start process to make model and predict.

train <- read_csv("data/data-train.csv")
## Parsed with column specification:
## cols(
##   id = col_character(),
##   trip_id = col_character(),
##   driver_id = col_character(),
##   rider_id = col_character(),
##   start_time = col_datetime(format = ""),
##   src_lat = col_double(),
##   src_lon = col_double(),
##   src_area = col_character(),
##   src_sub_area = col_character(),
##   dest_lat = col_double(),
##   dest_lon = col_double(),
##   dest_area = col_character(),
##   dest_sub_area = col_character(),
##   distance = col_double(),
##   status = col_character(),
##   confirmed_time_sec = col_double()
## )
test <- read_csv("data/data-test.csv")
## Parsed with column specification:
## cols(
##   src_area = col_character(),
##   datetime = col_datetime(format = ""),
##   coverage = col_logical()
## )

Lets check structure of train data:

glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ src_sub_area       <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e", "sxk9…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ dest_sub_area      <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e", "sxk9…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <chr> "confirmed", "confirmed", "nodrivers", "confirmed"…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…

The dataset includes information about:

Based on data above, we cant see coverage variable inside train data. but we found train data structure contain detail order from scotty. Some variable have mismatch datatype, we will try to change data type:

train <- train %>%
  mutate(src_area = as.factor(src_area),
        src_sub_area = as.factor(src_sub_area),
        dest_area = as.factor(dest_area),
        dest_sub_area = as.factor(dest_sub_area),
        status = as.factor(status)
        )

Data Wrangling

lets preview contain of data to understand more train data:

head(train)
## # A tibble: 6 x 16
##   id    trip_id driver_id rider_id start_time          src_lat src_lon src_area
##   <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl> <fct>   
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17    41.1    29.0 sxk9    
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34    40.9    29.1 sxk8    
## 3 59d0… <NA>    <NA>      59cd704… 2017-10-01 00:02:34    41.1    29.0 sxk9    
## 4 59d0… 59d006… 599dc0df… 59bd62c… 2017-10-01 00:03:29    41.0    29.0 sxk9    
## 5 59d0… 59d007… 59a58565… 59c17cd… 2017-10-01 00:04:24    41.1    29.0 sxk9    
## 6 59d0… 59d007… 59579d5f… 594afb1… 2017-10-01 00:05:32    41.0    28.9 sxk9    
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## #   dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## #   confirmed_time_sec <dbl>

we found there’s any NA value inside trip_id and rider_id, to make it more clear i will check NA value from all column before we take any action to data wrangling

summary(is.na(train))
##      id           trip_id        driver_id        rider_id      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:229532    FALSE:214631    FALSE:214632    FALSE:229532   
##                  TRUE :14901     TRUE :14900                    
##  start_time       src_lat         src_lon         src_area      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:229532    FALSE:229532    FALSE:229532    FALSE:229532   
##                                                                 
##  src_sub_area     dest_lat        dest_lon       dest_area      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:229532    FALSE:229532    FALSE:229532    FALSE:229532   
##                                                                 
##  dest_sub_area    distance         status        confirmed_time_sec
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical     
##  FALSE:229532    FALSE:229532    FALSE:229532    FALSE:229532      
## 

We found NA value only trip_id and driver_id, lets compare some row with NA value with non-NA row

train[c(1:3),]
## # A tibble: 3 x 16
##   id    trip_id driver_id rider_id start_time          src_lat src_lon src_area
##   <chr> <chr>   <chr>     <chr>    <dttm>                <dbl>   <dbl> <fct>   
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17    41.1    29.0 sxk9    
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34    40.9    29.1 sxk8    
## 3 59d0… <NA>    <NA>      59cd704… 2017-10-01 00:02:34    41.1    29.0 sxk9    
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## #   dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## #   confirmed_time_sec <dbl>

We found there’s any relation why trip_id and driver_id NA, is because status is “nodrivers”. and it will fullfil if status is “confirmed”. It will be our clue to extract information of coverage data.

Based on our target we have before, we will predict 7 days for each hour and each area. Why we dont specified time level only days, we cant explain detail which time get more dirver order. but if we make level into minute the data will too specific and result the data we have will make us confuse. So hourly interval is best time level to predict.

We can see from structure we dont have hour variable. We have start_time as users order time, we can extract specific level from Date, Days, Hour:

train_hourly <- train %>%
  mutate(src_area = as.factor(src_area),
        date = date(start_time),
        day = as.factor(wday(start_time, label = TRUE, abbr = FALSE)),
        hour = as.factor(hour(start_time)))
glimpse(train_hourly)
## Observations: 229,532
## Variables: 19
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
## $ date               <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 2…
## $ day                <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Su…
## $ hour               <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

we have new variable date it explain date order happen,day it explain day, and hour it explain hours when user order driver.

summary(train_hourly)
##       id              trip_id           driver_id           rider_id        
##  Length:229532      Length:229532      Length:229532      Length:229532     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    start_time                     src_lat         src_lon      src_area     
##  Min.   :2017-10-01 00:00:17   Min.   :40.79   Min.   :28.48   sxk3:  9946  
##  1st Qu.:2017-10-19 19:36:04   1st Qu.:41.00   1st Qu.:28.97   sxk8:  5636  
##  Median :2017-11-07 14:08:48   Median :41.03   Median :29.00   sxk9:213950  
##  Mean   :2017-11-04 18:32:48   Mean   :41.03   Mean   :29.00                
##  3rd Qu.:2017-11-19 03:44:19   3rd Qu.:41.06   3rd Qu.:29.04                
##  Max.   :2017-12-02 23:59:20   Max.   :41.13   Max.   :29.18                
##                                                                             
##   src_sub_area      dest_lat         dest_lon          dest_area     
##  sxk9e  :40869   Min.   : 5.923   Min.   :  0.1273   sxk9   :205449  
##  sxk9s  :24732   1st Qu.:41.001   1st Qu.: 28.9689   sxk3   : 12516  
##  sxk97  :24512   Median :41.036   Median : 29.0036   sxk8   :  4973  
##  sxk9k  :16612   Mean   :41.032   Mean   : 29.0025   sxkc   :  2264  
##  sxk9j  :12336   3rd Qu.:41.064   3rd Qu.: 29.0475   sxkb   :  1994  
##  sxk9h  :12205   Max.   :52.370   Max.   :121.4721   sxkd   :  1911  
##  (Other):98266                                       (Other):   425  
##  dest_sub_area       distance              status       confirmed_time_sec
##  sxk9e  : 35941   Min.   :   0.000   confirmed:214631   Min.   :  0.00    
##  sxk9s  : 25346   1st Qu.:   2.802   nodrivers: 14901   1st Qu.: 12.00    
##  sxk97  : 23031   Median :   4.804                      Median : 26.00    
##  sxk9k  : 13868   Mean   :   6.772                      Mean   : 46.41    
##  sxk9j  : 13704   3rd Qu.:   7.943                      3rd Qu.: 66.00    
##  sxk9h  : 10423   Max.   :9345.566                      Max.   :514.00    
##  (Other):107219                                                           
##       date                   day             hour       
##  Min.   :2017-10-01   Sunday   :21063   18     : 25403  
##  1st Qu.:2017-10-19   Monday   :27425   19     : 20572  
##  Median :2017-11-07   Tuesday  :31190   17     : 19156  
##  Mean   :2017-11-04   Wednesday:31999   16     : 16090  
##  3rd Qu.:2017-11-19   Thursday :32251   8      : 14938  
##  Max.   :2017-12-02   Friday   :50726   15     : 14227  
##                       Saturday :34878   (Other):119146

To make dataset convert perfectly to hourly interval, we need grouping transaction data and summarise data to hourly.

Example we have data with time 06:01:00,06:30:00,07:00:00 so it will group 06:00:00 have 2 transaction and 07:00:00 have 1 transaction.

We will grouping data by src_area, date, hour, and status, for data aggregation. Beside that after we grouping we summarise total_transaction it explain how many transaction within interval, and at status we have 2 class we already know before confirmed and nodrivers, we will count frequency:

coverage_hourly <- train_hourly %>% 
  group_by(src_area, date, day, hour, status) %>% 
  summarise(n = n()) %>%
  ungroup() %>%
  spread(status, n, fill=0) %>% 
  mutate(
    total_transaction = confirmed + nodrivers,
    datetime = ymd_h(paste(date, hour)))

#preview the data
head(coverage_hourly)
## # A tibble: 6 x 8
##   src_area date       day   hour  confirmed nodrivers total_transacti…
##   <fct>    <date>     <ord> <fct>     <dbl>     <dbl>            <dbl>
## 1 sxk3     2017-10-01 Sund… 0             0         1                1
## 2 sxk3     2017-10-01 Sund… 1             1         1                2
## 3 sxk3     2017-10-01 Sund… 3             0         1                1
## 4 sxk3     2017-10-01 Sund… 4             0         1                1
## 5 sxk3     2017-10-01 Sund… 7             0         1                1
## 6 sxk3     2017-10-01 Sund… 8             1         2                3
## # … with 1 more variable: datetime <dttm>

To make prediction model and good exploratory data analysis, we need complete time interval data. lets take look summary of our interval data:

summary(coverage_hourly$datetime)
##                  Min.               1st Qu.                Median 
## "2017-10-01 00:00:00" "2017-10-16 19:00:00" "2017-11-01 17:00:00" 
##                  Mean               3rd Qu.                  Max. 
## "2017-11-01 15:05:47" "2017-11-17 10:00:00" "2017-12-02 23:00:00"

based on summary our data is start from 2017-10-01 00:00:00 to 2017-12-02 23:00:00. We will determine for as start and end of interval

sorted_datetime <- coverage_hourly[order(coverage_hourly$datetime),]

datetime_start <- sorted_datetime$datetime[1]
datetime_end <- sorted_datetime$datetime[length(sorted_datetime$datetime)]

datetime_start
## [1] "2017-10-01 UTC"
datetime_end
## [1] "2017-12-02 23:00:00 UTC"

Make complete list from start interval until end of interval in hourly

all_date <- seq(datetime_start, datetime_end, by = "hour")
all_date <- data.frame(list(datetime = all_date))

# total hours from start to end date
nrow(all_date)
## [1] 1512

Total we have 1512 hour interval from start to the end. we will padding implement new time interval to our train data

sxk3 <- coverage_hourly %>% 
  filter(src_area == "sxk3") %>% 
  arrange(desc(datetime)) %>% 
  merge(all_date, all = T) %>% 
  mutate(src_area = as.factor("sxk3"))

sxk8 <- coverage_hourly %>% 
  filter(src_area == "sxk8") %>% 
  arrange(desc(datetime)) %>% 
  merge(all_date, all = T) %>% 
  mutate(src_area = as.factor("sxk8"))

sxk9 <- coverage_hourly %>% 
  filter(src_area == "sxk9") %>% 
  arrange(desc(datetime)) %>% 
  merge(all_date, all = T) %>% 
  mutate(src_area = as.factor("sxk9"))

coverage_all_date <- do.call("rbind", list(sxk3, sxk8, sxk9))

head(coverage_all_date)
##              datetime src_area       date    day hour confirmed nodrivers
## 1 2017-10-01 00:00:00     sxk3 2017-10-01 Sunday    0         0         1
## 2 2017-10-01 01:00:00     sxk3 2017-10-01 Sunday    1         1         1
## 3 2017-10-01 02:00:00     sxk3       <NA>   <NA> <NA>        NA        NA
## 4 2017-10-01 03:00:00     sxk3 2017-10-01 Sunday    3         0         1
## 5 2017-10-01 04:00:00     sxk3 2017-10-01 Sunday    4         0         1
## 6 2017-10-01 05:00:00     sxk3       <NA>   <NA> <NA>        NA        NA
##   total_transaction
## 1                 1
## 2                 2
## 3                NA
## 4                 1
## 5                 1
## 6                NA

Implement new interval data cause NA value for some row, it because from our train data dont have any transaction during some interval taime. We will fill this NA with 0 value:

coverage_all_date <- coverage_all_date %>% 
  mutate(
        date = date(datetime),
        hour = as.factor(hour(datetime)),
        day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
        confirmed = replace_na(confirmed, 0),
        nodrivers = replace_na(nodrivers,0),
        total_transaction = replace_na(total_transaction ,0),
        )
head(coverage_all_date)
##              datetime src_area       date    day hour confirmed nodrivers
## 1 2017-10-01 00:00:00     sxk3 2017-10-01 Sunday    0         0         1
## 2 2017-10-01 01:00:00     sxk3 2017-10-01 Sunday    1         1         1
## 3 2017-10-01 02:00:00     sxk3 2017-10-01 Sunday    2         0         0
## 4 2017-10-01 03:00:00     sxk3 2017-10-01 Sunday    3         0         1
## 5 2017-10-01 04:00:00     sxk3 2017-10-01 Sunday    4         0         1
## 6 2017-10-01 05:00:00     sxk3 2017-10-01 Sunday    5         0         0
##   total_transaction
## 1                 1
## 2                 2
## 3                 0
## 4                 1
## 5                 1
## 6                 0

Exploratory Data Analysis

Our case is to make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”. Next we will make our target variable coverage based on variable we have on dataset. status is variable we use labeling coverage, status with nodriver > 0, will label as insufficient and if status nodriver = 0, we will label it sufficient.

coverage_all_date <- coverage_all_date %>%
  mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))

head(coverage_all_date)
##              datetime src_area       date    day hour confirmed nodrivers
## 1 2017-10-01 00:00:00     sxk3 2017-10-01 Sunday    0         0         1
## 2 2017-10-01 01:00:00     sxk3 2017-10-01 Sunday    1         1         1
## 3 2017-10-01 02:00:00     sxk3 2017-10-01 Sunday    2         0         0
## 4 2017-10-01 03:00:00     sxk3 2017-10-01 Sunday    3         0         1
## 5 2017-10-01 04:00:00     sxk3 2017-10-01 Sunday    4         0         1
## 6 2017-10-01 05:00:00     sxk3 2017-10-01 Sunday    5         0         0
##   total_transaction     coverage
## 1                 1 insufficient
## 2                 2 insufficient
## 3                 0   sufficient
## 4                 1 insufficient
## 5                 1 insufficient
## 6                 0   sufficient

Until this step, we have label as target variable is coverage with 2 level “sufficient” and “insufficient” in train data.

Check proportion for our target variable coverage, on process we have prepare some set of data frame. coverage_hourly data before we padding it, and after padding data we have coverage_all_date. let check proportion before padding and after padding.

coverage_hourly <- coverage_hourly %>%
  mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))

proportion <- data.frame(prop.table(table(coverage_hourly$coverage)))

proportion %>% 
  ggplot(aes(x = Var1, y = Freq)) + 
  geom_bar(aes(fill = Freq), stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = percent(Freq)),
            fontface = "bold",
            color = "white",
            position = position_stack(vjust = 0.5)) +
  theme(legend.position = "none") +
  labs(
    title = "Coverage proportion before padding",
    x = "",
    y = ""
  )

Before padding proportion of coverage is 64% of data label as insufficient and 36% of data label as sufficient. Next we show train data after implement padding or new time interval

proportion <- data.frame(prop.table(table(coverage_all_date$coverage)))

proportion %>% 
  ggplot(aes(x = Var1, y = Freq)) + 
  geom_bar(aes(fill = Freq), stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = percent(Freq)),
            fontface = "bold",
            color = "white",
            position = position_stack(vjust = 0.5)) +
  theme(legend.position = "none") +
  labs(
    title = "Coverage proportion after padding",
    x = "",
    y = ""
  )

Proportion after we padding is more better before we padding, proportion is 54.1& labeled as insufficient and 45.9% labeled as sufficient.

This dataset have src_are explain different area, lets check how is frequency from our train dataset:

proportion <- data.frame(summary(coverage_all_date$src_area)) %>% 
  rownames_to_column(., "VALUE") %>%
  mutate("Source Area" = .[,1],
         "Freq" = .[,2]) %>% 
  select(3,4)

proportion
##   Source Area Freq
## 1        sxk3 1512
## 2        sxk8 1512
## 3        sxk9 1512

it show for each area have same frequency data 1512 transaction, its because we has padding the train dataset before. Lets check proportion of coverage for each area:

proprotion <- coverage_all_date %>% 
  group_by(src_area) %>% 
  summarise(insufficient = sum(coverage == "insufficient"),
            sufficient = sum(coverage == "sufficient"),
            total = n(),
            "proportion insufficient" = insufficient/n(),
            "proportion sufficient" = sufficient/n())

proprotion
## # A tibble: 3 x 6
##   src_area insufficient sufficient total `proportion insuffi… `proportion suffi…
##   <fct>           <int>      <int> <int>                <dbl>              <dbl>
## 1 sxk3              947        565  1512                0.626              0.374
## 2 sxk8              258       1254  1512                0.171              0.829
## 3 sxk9             1249        263  1512                0.826              0.174

More better we visualize it

proprotion %>% 
  select("src_area","proportion insufficient", "proportion sufficient") %>% 
  rename("insufficient" = "proportion insufficient",
         "sufficient" = "proportion sufficient") %>% 
  gather(key = "prop",
         value = "value",
         -src_area) %>% 
  ggplot(aes(x = prop, y = value)) +
  geom_bar(aes(fill = prop), stat = "identity") +
  scale_y_continuous(labels = scales::percent) +
  geom_text(aes(label = percent(value)),
            fontface = "bold",
            color = "white",
            position = position_stack(vjust = 0.5)) +
  facet_wrap( ~ src_area) +
  labs(
    title = "Coverage proportion group by Area",
    x = "",
    y = ""
  ) +
  theme(legend.position = "none") 

From 3 area we have, only sxk8 that have more than 50% sufficient label, it indicate other area sxk3 and sxk9 need more driver in those area. lets check other visualize to get better insight.

We will try to use heatmap type visualization

coverage_all_date %>% 
  mutate(day = as.factor(wday(datetime, label = TRUE, abbr = TRUE))) %>% 
  group_by(day, hour, src_area) %>% 
  summarise(sufficient = sum(coverage == "sufficient")) %>% 
  ggplot(aes(x = day, y = hour, fill = sufficient)) +
  geom_tile() +
  facet_wrap(~src_area) +
   theme(legend.position = "bottom") 

To make more clear and get detail insight, i would make other visualization using bar plot:

coverage_all_date %>% 
  select(-c("datetime", "date", "day")) %>% 
  group_by(hour, src_area) %>% 
  summarise(insufficient = sum(coverage == "insufficient"),
            sufficient = sum(coverage == "sufficient")) %>% 
  gather("type","value",-src_area, -hour) %>% 
  ggplot(aes(x = hour, y = value, fill = type)) +
  geom_col(position = "dodge") +
  facet_grid(src_area ~ .) +
  theme(legend.position = "bottom")  +
  labs(title = "Coverage hourly group by Area")

some insight we can get is:
1. sxk9 area is the most insufficient driver area, from heatmap all days its show dark area it mean all days and every hour users didnt get driver when they order. 2. sxk8 area is the most sufficient driver area, every hour and every hour confirmed driver than other area. 3. However area sxk8 still in 08:00 it nearly insufficient label more than sufficient label, especially in working days from monday to friday, it show more dark blue on heatmap. 4.sxk3 area get more driver start from morning 06:00 which people start to activity until late night around 00:00 and this case happen always every day looks from heatmap. 5. There is corellation between days, hour, area to coverage.

glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id                 <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id            <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id          <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id           <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time         <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat            <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon            <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area           <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area       <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat           <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon           <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area          <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area      <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance           <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status             <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
ggcorr(train, label = T)
## Warning in ggcorr(train, label = T): data in column(s) 'id', 'trip_id',
## 'driver_id', 'rider_id', 'start_time', 'src_area', 'src_sub_area', 'dest_area',
## 'dest_sub_area', 'status' are not numeric and were ignored

summary(coverage_all_date)
##     datetime                   src_area         date                   day     
##  Min.   :2017-10-01 00:00:00   sxk3:1512   Min.   :2017-10-01   Sunday   :648  
##  1st Qu.:2017-10-16 17:45:00   sxk8:1512   1st Qu.:2017-10-16   Monday   :648  
##  Median :2017-11-01 11:30:00   sxk9:1512   Median :2017-11-01   Tuesday  :648  
##  Mean   :2017-11-01 11:30:00               Mean   :2017-11-01   Wednesday:648  
##  3rd Qu.:2017-11-17 05:15:00               3rd Qu.:2017-11-17   Thursday :648  
##  Max.   :2017-12-02 23:00:00               Max.   :2017-12-02   Friday   :648  
##                                                                 Saturday :648  
##       hour        confirmed        nodrivers       total_transaction
##  0      : 189   Min.   :  0.00   Min.   :  0.000   Min.   :   0.00  
##  1      : 189   1st Qu.:  1.00   1st Qu.:  0.000   1st Qu.:   2.00  
##  2      : 189   Median :  6.00   Median :  1.000   Median :   7.00  
##  3      : 189   Mean   : 47.32   Mean   :  3.285   Mean   :  50.60  
##  4      : 189   3rd Qu.: 42.00   3rd Qu.:  3.000   3rd Qu.:  49.25  
##  5      : 189   Max.   :874.00   Max.   :392.000   Max.   :1240.00  
##  (Other):3402                                                       
##          coverage   
##  insufficient:2454  
##  sufficient  :2082  
##                     
##                     
##                     
##                     
## 

Data Preprocessing

Meanwhile after several processing data we has make new data frame coverage_hourly with structure:

glimpse(coverage_hourly)
## Observations: 3,862
## Variables: 9
## $ src_area          <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk…
## $ date              <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 20…
## $ day               <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sun…
## $ hour              <fct> 0, 1, 3, 4, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ confirmed         <dbl> 0, 1, 0, 0, 0, 1, 2, 0, 3, 7, 4, 2, 0, 5, 0, 0, 0, …
## $ nodrivers         <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 3, 1, …
## $ total_transaction <dbl> 1, 2, 1, 1, 1, 3, 3, 1, 4, 8, 4, 3, 1, 5, 1, 3, 1, …
## $ datetime          <dttm> 2017-10-01 00:00:00, 2017-10-01 01:00:00, 2017-10-…
## $ coverage          <fct> insufficient, insufficient, insufficient, insuffici…

some variable in our train dataset still didint use in this new data frame, we can put into our new dataframe

int_var <- train_hourly %>% 
  group_by(src_area, date, day, hour) %>% 
  summarise(mean_distance = mean(distance),
            sum_distance = sum(distance),
            median_distance = median(distance),
            mean_confirmed_time_sec = mean(confirmed_time_sec),
            sum_confirmed_time_sec = sum(confirmed_time_sec),
            median_confirmed_time_sec = median(confirmed_time_sec)) %>% 
  ungroup() %>% 
  select(-c(src_area, date, day, hour))
glimpse(int_var)
## Observations: 3,862
## Variables: 6
## $ mean_distance             <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ sum_distance              <dbl> 1.6709748, 16.8151311, 14.1027795, 3.633148…
## $ median_distance           <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ mean_confirmed_time_sec   <dbl> 0.00000, 22.00000, 0.00000, 0.00000, 0.0000…
## $ sum_confirmed_time_sec    <dbl> 0, 44, 0, 0, 0, 12, 42, 0, 306, 302, 310, 5…
## $ median_confirmed_time_sec <dbl> 0.0, 22.0, 0.0, 0.0, 0.0, 0.0, 9.0, 0.0, 43…

from variable distance and confirmed_time_sec we do mean, median, sum for make new variable. we combine it with coverage_hourly

coverage_hourly <- cbind(coverage_hourly, int_var)

Cross Validation

Cross-validation (CV) is a statistical method that can be used to evaluate the performance of models or algorithms where the data is separated into two subsets namely learning process data and validation / evaluation data. We already have train and test dataset, in this case i will use 3 dataset, is train, validation and test data.

We check proportion our existing train and test:

nrow(train)
## [1] 229532
nrow(test)
## [1] 504

total we have 229.532 data observation and our test data 504 target to predict. The proportion is already well, because we have more 99% data to train our model later.

This train dataset i will seperate into 2 dataset: data train and data validation. I will take 15% from data train to put into data validation.

set.seed(100)
intial_train <- initial_split(coverage_all_date, prop = 0.80, strata = "coverage")
train_training <- training(intial_train)
train_validation <- testing(intial_train)
prop.table(table(train_training$coverage))
## 
## insufficient   sufficient 
##    0.5410468    0.4589532
train_training_upsample <- upSample(x = within(train_training, rm("coverage")), y = train_training$coverage, yname = "coverage")

prop.table(table(train_training_upsample$coverage))
## 
## insufficient   sufficient 
##          0.5          0.5

Modeling

Model Fitting

Before we try to choose model, we have test dataset that give us rules that our model result should fit with our test structure. Lets take look data test we have

glimpse(test)
## Observations: 504
## Variables: 3
## $ src_area <chr> "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk…
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
summary(test)
##    src_area            datetime                   coverage      
##  Length:504         Min.   :2017-12-03 00:00:00   Mode:logical  
##  Class :character   1st Qu.:2017-12-04 17:45:00   NA's:504      
##  Mode  :character   Median :2017-12-06 11:30:00                 
##                     Mean   :2017-12-06 11:30:00                 
##                     3rd Qu.:2017-12-08 05:15:00                 
##                     Max.   :2017-12-09 23:00:00

Data test is contain 3 variable, this data test will be our prediction result, in this case we will try predict next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).

We have src_area it source area that user order driver. We can see datetime interval separate each 1 hour, it mean our prediction value will predict each hour or hourly for 7 days. We have column coverage variable, it will be our target variable contain sufficient and insufficient, based on our bussiness case we would to know insufficient as priority and it will positive.

Our variable to predict using data test strict only based on this 3 variable, for datetime we can extract it into hour interval.

test <- test %>%
  mutate(src_area = as.factor(src_area),
        date = date(datetime),
        day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
        hour = as.factor(hour(datetime)))
glimpse(test)
## Observations: 504
## Variables: 6
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ date     <date> 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03,…
## $ day      <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour     <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…

Based on data test exploration we do above, the only predictor variable we can use only src_area, datetime,date,day,hour and for target still coverage with 2 class sufficient and insufficient. we can removing unsued variable from our data train and data validation we make before

train_training <- train_training %>% 
  select(-c("confirmed","nodrivers","total_transaction","datetime","date"))
train_validation <- train_validation %>% 
  select(-c("confirmed","nodrivers","total_transaction","datetime","date"))
train_training_upsample <- train_training_upsample %>% 
  select(-c("confirmed","nodrivers","total_transaction","datetime","date"))

For models to predict, we only have choice choose model that accept only factor, or categorical predictor variable and categorical target variable. I decided to compare 4 type model:

  1. Logistict Regression
  2. Naive Bayes
  3. Decision Tree
  4. Random Forest

Evaluation

In this case, to compare each model we will use confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.

The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.

Logistict Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).

Logistic regression is a technique that is well suited for examining the relationship between a categorical response variable and one or more categorical or continuous predictor variables. we can check our structure of our dataset, we can see both of predictor or target is factor.

glimpse(train_training_upsample)
## Observations: 3,928
## Variables: 4
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ day      <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour     <fct> 0, 3, 7, 8, 11, 12, 13, 14, 17, 19, 20, 21, 1, 7, 8, 10, 11,…
## $ coverage <fct> insufficient, insufficient, insufficient, insufficient, insu…

Assumption Checking

Logistict regression follow 3 assumtion before we use it:

  • Multicollinearity: between predictors do not correlate with each other, we can see our predictor contain categorical and each dont have corelate each other.

  • Independence of Observations: between observations are mutually independent & do not originate from repeated measurements. We can check ourtraining data actually extraction for hourly and it show historical and it didnt repeat, if repet for time and day it could have different area

  • Linearity of Predictor & Log of Odds: the way of interpretation refers to this assumption. for numerical variables, an increase of 1 value will increase the log of odds.

From total 3 assumption, 2 assumption we confidence our train dataset meet the requirment. So we can continue using our train dataset

Model Train

We will train logistic regression model. at first we will use all our variabel inside train data frame, and use coverage as target.

logistict_model <- glm(coverage ~ ., train_training_upsample, family = "binomial")
summary(logistict_model)
## 
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = train_training_upsample)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.67425  -0.73517   0.00039   0.70405   2.69108  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.30543    0.19411  -1.573 0.115612    
## src_areasxk8  2.45943    0.10611  23.179  < 2e-16 ***
## src_areasxk9 -1.14049    0.09777 -11.665  < 2e-16 ***
## day.L        -0.26384    0.10837  -2.435 0.014912 *  
## day.Q        -0.30119    0.10796  -2.790 0.005274 ** 
## day.C         0.02223    0.10799   0.206 0.836872    
## day^4         0.41647    0.10847   3.840 0.000123 ***
## day^5         0.16530    0.10786   1.533 0.125390    
## day^6        -0.02934    0.10725  -0.274 0.784410    
## hour1         0.10385    0.27392   0.379 0.704611    
## hour2         0.68156    0.26648   2.558 0.010538 *  
## hour3         0.68393    0.26233   2.607 0.009131 ** 
## hour4         1.44160    0.27778   5.190 2.11e-07 ***
## hour5         1.39752    0.27877   5.013 5.35e-07 ***
## hour6         0.99198    0.26675   3.719 0.000200 ***
## hour7        -0.64403    0.28296  -2.276 0.022843 *  
## hour8        -2.09973    0.29788  -7.049 1.80e-12 ***
## hour9        -0.91742    0.27947  -3.283 0.001028 ** 
## hour10        0.11837    0.27001   0.438 0.661097    
## hour11        0.09742    0.27242   0.358 0.720635    
## hour12       -0.38025    0.28216  -1.348 0.177774    
## hour13       -0.69846    0.27990  -2.495 0.012581 *  
## hour14       -0.48210    0.28178  -1.711 0.087100 .  
## hour15       -0.71407    0.28540  -2.502 0.012350 *  
## hour16       -1.05535    0.28809  -3.663 0.000249 ***
## hour17       -1.12399    0.28381  -3.960 7.48e-05 ***
## hour18       -0.99653    0.28472  -3.500 0.000465 ***
## hour19       -1.10776    0.28312  -3.913 9.13e-05 ***
## hour20       -0.04379    0.27489  -0.159 0.873443    
## hour21        0.09773    0.27038   0.361 0.717750    
## hour22        0.69420    0.26831   2.587 0.009672 ** 
## hour23        0.24501    0.26759   0.916 0.359868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5445.4  on 3927  degrees of freedom
## Residual deviance: 3711.9  on 3896  degrees of freedom
## AIC: 3775.9
## 
## Number of Fisher Scoring iterations: 5

We have only 3 predictor variable, but why it show lot of variable? because glm function will automatically do dummy coding means is reoding the original categorical variable into a set of binary variables that have values of one and zero.

From summary we can analyze the fitting and interpret that the model is telling us. We can se average all of our predictor variable are statistivally significant. No matter some dummy coding variable have big p-value or more <0.05 we cant judge one by one we must sum it as one variable.

Model Evaluation

We have make our logistic regression using all predict variable on dataset. Next this model use to predict with valudation dataset.

logistict_prediction <- predict(logistict_model, train_validation, type = "response")

head(logistict_prediction)
##         2         3         5        17        23        30 
## 0.4641338 0.6068290 0.7674631 0.2136755 0.6098418 0.7396008

Logistic regression return value within range of [0,1] and isnt binary class. We will convert probabilty into class using threshold value (by default we will put 0.5 as threshold value).

logistict_prediction <- as.factor(if_else(logistict_prediction >= 0.5, "sufficient", "insufficient"))

head(logistict_prediction)
## [1] insufficient sufficient   sufficient   insufficient sufficient  
## [6] sufficient  
## Levels: insufficient sufficient

As result we have categorical or class prediction, we use confusion matrix to evaluate the model

matrix <- confusionMatrix(logistict_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

logistict_matrix <- matrix_result(matrix, "Logistict Reg.")
logistict_matrix
##            Model Accuracy Sensitivity Specificity Precision
## 1 Logistict Reg.  79.9117    83.26531    75.96154  80.31496

These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

The result show our prediction on validation dataset using logistic regression model is 81.6% for accuracy, it mean that our result data prediction 81.6% is correctly classified. Precision/positive predicted value around 80.60%, mean our positive prediction is correctly classified. Value of sensitivity is 86.95% and specificity 75.32%, this indicate positive and negative predict correctly classified range around 75% - 86%.

Naive Bayes

Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets.Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. There are certain characteristics of Naive Bayes that should be considered:

  1. Assumes that all features of the dataset are equally important and independent. This allows Naive Bayes to perform faster computation (the algorithms is quite simple).
  2. Prone to bias due to data scarcity. In some cases, our data may have a distribution where scarce observations lead to probabilities approximating close to 0 or 1, which introduces a heavy bias into our model that could lead to poor performance on unseen data.
  3. More appropriate for data with categoric predictors. This is because Naive Bayes is sensitive to data scarcity. Meanwhile, a continuous variable might contain really scarce or even only one observation for certain value.
  4. Apply Laplace estimator/smoothing for data scarcity problem. Laplace estimator proposes the adding of a small number (usually 1) to each of the counts in the frequency table. This subsequently ensures that each class-feature combination has a non-zero probability of occurring.

Although it is often outperformed by other techniques, and despite the naive design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, we will implement it to our case.

Model Train

We need build our model, and in this case we also apply Laplace Estimator. At first naive bayes model, we will use all predictor variable we have:

naive_model <- naiveBayes(coverage ~ ., data = train_training_upsample, laplace = 1)

Model Evaluation

Next this model use to predict with valudation dataset, and using confussion matrix

naive_prediction <- predict(naive_model, train_validation)
matrix <- confusionMatrix(naive_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

naive_matrix <- matrix_result(matrix, "Naive Bayes")
naive_matrix
##         Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes  80.3532     85.5102    74.27885  79.65779

These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.

Model Improvement

There is way to improve our naive bayes, before we do it we will check ROC curve from our naive bayes:

naive_prediction_raw <- as.data.frame(predict(naive_model, train_validation, type = "raw"))

# ROC
naive_roc <- data.frame(prediction = naive_prediction_raw[,2],
                        trueclass = as.numeric(train_validation$coverage=="sufficient"))

naive_roc_pred <- prediction(naive_roc$prediction, naive_roc$trueclass) 

# ROC curve
plot(performance(naive_roc_pred, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1)

The ROC is a curve that shows the performance of the classification model for all thresholds. ROC represents graphics from:

  • Recall / sensitivity (true positive rate) on they axis.
  • 1-specificity (false positive rate) on thex axis.

Next we calculate AUC

# AUC
auc <- performance(naive_roc_pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8481113

AUC is the area under the ROC curve. AUC values indicate the success of the model predicting / differentiating the two classes. The area value has an interval between 0 to 1. If the AUC value is close to 1, it means that the classification model is able to predict / distinguish the two classes well. However, if the AUC value is close to 0.5 it means that the classification model is not able to predict / distinguish the two classes well.
* AUC value close to 1: the classification model is able to predict / distinguish the two classes well
* AUC values close to 0.5: the classification model is not able to predict / distinguish the two classes well.

Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8436585 it mean we have chance to improve this Naive Bayes Model.

In our case we are focus “sufficient” class mean we want predcit when scotty success full fullfill all order from users, it means we are focus on Precision parameter which mean our Naive Bayes already give us good result.

Based on our AUC score we still can improve our model, and one of method to tuning model is change the threshold.

# model tuning - metrics function
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

# apply function metrics
for(i in 1:100){
  result[i,] = metrics(cutoff = co[i], 
                     prob = naive_prediction_raw$sufficient, 
                     ref = as.factor(ifelse(train_validation$coverage=="sufficient", 1, 0)), 
                     postarget = "1", 
                     negtarget = "0")
}

# visualize
ggplotly(tibble("Recall" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "Metrics", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()))

From plot above, we get information break even point is around 0.45. we will set it around 0.49 because we are focusing on Precision parameter, meanwhile we still want other paramaeter get good result.

#Tuning Threshold
naive_prediction_tuning <- naive_prediction_raw %>%
  mutate(label = as.factor(ifelse(sufficient >= 0.49, "sufficient", "insufficient"))) %>% 
  select(label)


naive_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, train_validation$coverage, positive = "insufficient")
naive_matrix_tuning <- matrix_result(naive_matrix_tuning, "Naive Bayes Tuning")

naive_matrix_tuning
##                Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes Tuning  80.3532    85.10204    74.75962  79.88506

Decision Tree

Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree. There are certain characteristics of decision tree model:

  • Perform well on both numerical and categorical variable.
  • All predictors are assumed to interact.
  • Quite robust to the problem of multicollinearity. A decision tree will choose a variable that has the highest information gain in one split, whereas a method such as logistic regression would have used both.
  • Robust and insensitive to outliers. Splitting will happen at a condition where it maximizes the homogeneity within resulting groups. Outliers will have little influence on the splitting process.

Model Train

let build decision tree model

dtree_model <- ctree(formula = coverage ~., data = train_training_upsample)
plot(dtree_model)

dtree_prediction <- predict(dtree_model, train_validation)
matrix <- confusionMatrix(dtree_prediction, train_validation$coverage)
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

dtree_matrix <- matrix_result(matrix, "Decision Tree")
dtree_matrix
##           Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree  80.7947    87.34694    73.07692  79.25926
predict_result <- predict(dtree_model, test)

Random Forest

ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
randomforest_model <- train(coverage ~ ., data = train_training_upsample, method = "rf", trControl = ctrl, ntree = 100)
randomforest_prediction <- predict(randomforest_model, train_validation)
matrix <- confusionMatrix(randomforest_prediction, train_validation$coverage)
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)

randomforest_matrix <- matrix_result(matrix, "Random Forest")
randomforest_matrix
##           Model Accuracy Sensitivity Specificity Precision
## 1 Random Forest  79.9117     84.4898    74.51923  79.61538
tbl = matrix(data=c(51, 49, 24, 26), nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))

chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
## X-squared           
## 0.1200000 0.7290345
sqrt(chi2$statistic / sum(tbl))
##  X-squared 
## 0.02828427
test$coverage <- predict_result
test <- test %>% 
  mutate(coverage = as.factor(if_else(coverage == 1,"sufficient","insufficient"))) %>% 
  select(-c(date,day,hour))
write.csv(test, "data/data-test-result.csv")
mean_distance <- train_hourly %>% 
  group_by(src_area, date, day, hour) %>% 
  summarise(mean_distance = mean(distance)) %>% 
  ungroup() %>% 
  select(mean_distance)

coverage_hourly <- cbind(coverage_hourly, mean_distance)